Add automated visual testing with Claude vision validation#8
Add automated visual testing with Claude vision validation#8
Conversation
- Create scripts/visual_test_overlay.py for automated screenshot capture - Add docs/plans/visual-test-checklist.md with validation criteria - Enables screenshot-based testing on KDE Plasma environments - Screenshots can be reviewed with AI vision for PR-05 validation Test workflow: 1. Run visual_test_overlay.py on graphical KDE session 2. Captures 5 test scenarios (empty, suggestions, max, cleared) 3. Share screenshots for visual review Co-Authored-By: Claude <noreply@anthropic.com>
- Runs visual_test_overlay.py under xvfb with full graphics stack - Uploads screenshots as downloadable artifacts - Triggers on manual dispatch or overlay code changes - Creates montage image for easy review - Includes overlay unit tests as separate job Usage: - Go to Actions tab → Visual Tests → Run workflow - Download screenshots artifact after completion - Review screenshots for visual validation Co-Authored-By: Claude <noreply@anthropic.com>
This enables fully automated visual validation of the overlay UI in CI: - scripts/validate_screenshots.py: Sends screenshots to Claude API for vision-based validation against defined criteria - Updated visual-tests.yml workflow to: - Capture screenshots under xvfb - Call Claude API to validate each screenshot - Generate pass/fail report with detailed reasoning - Upload artifacts for manual review - Show summary in GitHub Actions UI - Updated docs with complete workflow instructions To enable: 1. Add ANTHROPIC_API_KEY secret in repo settings 2. Run "Visual Tests" workflow from Actions tab 3. Review automated validation results The workflow validates: - Empty overlay state - Demo suggestions display (2 chips) - Single suggestion display - Maximum 3 chips limit - Cleared suggestions state Co-Authored-By: Claude <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Review Summary by QodoAdd automated visual testing with Claude vision validation
WalkthroughsDescription• Automated visual testing system for overlay UI with Claude vision validation • Screenshot capture harness for 5 test scenarios (empty, suggestions, single, max, cleared) • Claude API integration for automated visual validation against defined criteria • GitHub Actions workflow with xvfb for headless screenshot capture and validation • Comprehensive documentation with setup instructions and troubleshooting guide Diagramflowchart LR
A["Test Scenarios"] -->|"Capture"| B["visual_test_overlay.py"]
B -->|"Generate"| C["Screenshots"]
C -->|"Validate"| D["validate_screenshots.py"]
D -->|"Send to API"| E["Claude Vision"]
E -->|"Return Results"| F["Validation Report"]
G["GitHub Actions"] -->|"Run"| B
G -->|"Run"| D
G -->|"Upload"| H["Artifacts"]
F -->|"Display"| H
File Changes1. scripts/visual_test_overlay.py
|
CI Feedback 🧐A test triggered by this PR failed. Here is an AI-generated analysis of the failure:
|
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (4 files)
|
There was a problem hiding this comment.
Pull request overview
Adds an automated visual-regression workflow for the PySide6 overlay UI by capturing screenshots in CI and (optionally) validating them with Claude vision, with supporting scripts and documentation.
Changes:
- Added a screenshot capture harness for the overlay across multiple UI states.
- Added a Claude-vision validator that scores screenshots against scenario-specific criteria and emits a JSON report.
- Added a GitHub Actions workflow to run screenshot capture + optional validation and publish artifacts/summaries.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
scripts/visual_test_overlay.py |
Launches the overlay in fallback/demo states and captures scenario screenshots. |
scripts/validate_screenshots.py |
Sends captured screenshots to Claude and parses structured pass/fail JSON results. |
.github/workflows/visual-tests.yml |
CI workflow to run capture under Xvfb, optionally validate via Claude, and upload artifacts/summary. |
docs/plans/visual-test-checklist.md |
Setup + usage + troubleshooting documentation for the visual test workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - name: Validate screenshots with Claude | ||
| id: validate | ||
| if: ${{ (inputs.run_claude_validation == true || inputs.run_claude_validation == '') && env.HAS_API_KEY == 'true' }} | ||
| run: | |
There was a problem hiding this comment.
The validation step’s if: condition relies on inputs.run_claude_validation == '' to mean “default enabled”, but on pull_request runs inputs.* is not provided (it evaluates as null/undefined, not an empty string), so this condition can evaluate to false and skip validation even when the API key is configured. Consider making the default explicit for non-workflow_dispatch events (e.g., github.event_name != 'workflow_dispatch' || inputs.run_claude_validation) and check the secret directly (secrets.ANTHROPIC_API_KEY != '') instead of piping through env.
| # Test 4: Three suggestions (max) | ||
| print("\n5. Test: Maximum suggestions (3)") | ||
| three_suggestions = DEMO_SUGGESTIONS + [ | ||
| { | ||
| "action": "tile_right", | ||
| "key": "Meta+Right", | ||
| "description": "Tile window to right half", | ||
| "priority": 50, | ||
| } | ||
| ] | ||
| overlay.set_suggestions_fallback(three_suggestions) | ||
| app.processEvents() | ||
| time.sleep(0.5) | ||
| results["04_max_three"] = take_screenshot("04_max_three", screenshots_dir) |
There was a problem hiding this comment.
This “maximum suggestions” scenario only passes exactly 3 suggestions. That doesn’t actually test the max-3 truncation behavior implemented in OverlayWindow.update_suggestions (it slices to suggestions[:3]). To validate truncation, pass 4+ suggestions and ensure the 4th is not rendered (and update Claude criteria to assert it’s absent).
| overlay.set_suggestions_fallback(DEMO_SUGGESTIONS) | ||
| app.processEvents() | ||
| time.sleep(0.5) | ||
| results["02_with_suggestions"] = take_screenshot("02_suggestions", screenshots_dir) |
There was a problem hiding this comment.
Result key results["02_with_suggestions"] doesn’t match the test ID/filename (02_suggestions) used elsewhere (validation criteria, screenshot name). This makes the summary output harder to correlate and can confuse consumers if results keys are later reused. Prefer using the same stable test ID as the key (e.g., 02_suggestions).
| results["02_with_suggestions"] = take_screenshot("02_suggestions", screenshots_dir) | |
| results["02_suggestions"] = take_screenshot("02_suggestions", screenshots_dir) |
| for png_file in screenshots_dir.glob("*.png"): | ||
| # Extract test ID from filename like "overlay_test_01_empty_20250107_123456.png" | ||
| name = png_file.stem | ||
| for test_id in VALIDATION_CRITERIA: | ||
| if test_id in name: | ||
| screenshots[test_id] = png_file | ||
| break |
There was a problem hiding this comment.
find_screenshots() will pick an arbitrary file when multiple screenshots for the same test ID exist (glob order is not guaranteed), which can make local runs nondeterministic if the screenshots directory contains older images. Prefer selecting the most recent match per test ID (e.g., by parsing the timestamp in the filename or using mtime) and/or erroring on duplicates.
| # Parse response | ||
| response_text = message.content[0].text | ||
|
|
There was a problem hiding this comment.
Response parsing assumes the first content block is text (message.content[0].text). Anthropic responses can include multiple content blocks; if the first block isn’t text or JSON is split across blocks, this will raise/parse incorrectly. Make this more robust by concatenating all text blocks (or otherwise extracting text blocks) before JSON extraction.
| # Parse response | |
| response_text = message.content[0].text | |
| # Parse response: concatenate all text content blocks | |
| text_blocks: list[str] = [] | |
| for block in message.content: | |
| # Support both SDK ContentBlock objects and plain dicts | |
| block_type = getattr(block, "type", None) or (block.get("type") if isinstance(block, dict) else None) | |
| if block_type == "text": | |
| text = getattr(block, "text", None) | |
| if text is None and isinstance(block, dict): | |
| text = block.get("text") | |
| if isinstance(text, str): | |
| text_blocks.append(text) | |
| if not text_blocks: | |
| raise ValueError("No text content blocks found in Claude response") | |
| response_text = "\n".join(text_blocks) |
|
|
||
| ### GitHub Actions fails to capture screenshots | ||
| - Check that `xvfb-run` is installed (should be automatic) | ||
| - Verify `QT_QPA_PLATFORM=offscreen` is set |
There was a problem hiding this comment.
Troubleshooting recommends setting QT_QPA_PLATFORM=offscreen, but the visual test workflow relies on screenshot tools capturing an X11 display. With offscreen, Qt often won’t create a visible window to capture. Update this guidance to reflect the workflow’s actual requirements (xvfb + xcb/normal platform).
| - Verify `QT_QPA_PLATFORM=offscreen` is set | |
| - Verify `QT_QPA_PLATFORM` is **not** set to `offscreen` (leave it unset or set to `xcb` so Qt uses the X11 display) |
| QT_QPA_PLATFORM: offscreen | ||
| DISPLAY: ':99' |
There was a problem hiding this comment.
In the capture step, setting QT_QPA_PLATFORM=offscreen will typically prevent Qt from creating an on-screen X11 window, so scrot/import may capture a blank/root image. For screenshot-based testing under xvfb, run with the normal xcb platform (e.g., unset QT_QPA_PLATFORM or set it to xcb) and rely on xvfb-run to provide DISPLAY.
| QT_QPA_PLATFORM: offscreen | |
| DISPLAY: ':99' | |
| QT_QPA_PLATFORM: xcb |
Code Review by Qodo
✅ 1.
|
| # Call Claude API | ||
| client = anthropic.Anthropic(api_key=api_key) | ||
|
|
||
| message = client.messages.create( | ||
| model="claude-sonnet-4-20250514", | ||
| max_tokens=1024, | ||
| messages=[ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| { | ||
| "type": "image", | ||
| "source": { | ||
| "type": "base64", | ||
| "media_type": media_type, | ||
| "data": image_data, | ||
| }, | ||
| }, | ||
| { | ||
| "type": "text", | ||
| "text": prompt, | ||
| }, | ||
| ], | ||
| } | ||
| ], | ||
| ) | ||
|
|
There was a problem hiding this comment.
1. Claude api call unhandled 📘 Rule violation ⛯ Reliability
validate_screenshot_with_claude() calls the Anthropic API without handling network/API exceptions, so transient failures can crash the script and produce stack traces in CI logs. This reduces reliability and may expose internal error details to end users of CI output.
Agent Prompt
## Issue description
`scripts/validate_screenshots.py` calls the Anthropic API without exception handling. Network/API errors can crash the script and print stack traces in CI logs.
## Issue Context
This script is intended for CI usage; failures should degrade gracefully with actionable error messages and controlled exit codes.
## Fix Focus Areas
- scripts/validate_screenshots.py[198-224]
- scripts/validate_screenshots.py[378-384]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| - name: Validate screenshots with Claude | ||
| id: validate | ||
| if: ${{ (inputs.run_claude_validation == true || inputs.run_claude_validation == '') && env.HAS_API_KEY == 'true' }} | ||
| run: | |
There was a problem hiding this comment.
2. Validation skipped on prs 🐞 Bug ✓ Correctness
The Claude validation step is gated on inputs.run_claude_validation, but inputs is only defined for workflow_dispatch, so on pull_request runs the condition evaluates false and validation is skipped even when an API key is configured.
Agent Prompt
### Issue description
Claude validation is unintentionally skipped on `pull_request` runs because the step condition depends on `inputs.run_claude_validation`, which is only populated for `workflow_dispatch`. This defeats the stated goal of running visual validation on PRs.
### Issue Context
- The workflow triggers on `pull_request`.
- The validation step’s `if:` condition uses `inputs.run_claude_validation == true || inputs.run_claude_validation == ''`.
### Fix Focus Areas
- .github/workflows/visual-tests.yml[3-25]
- .github/workflows/visual-tests.yml[100-130]
### Suggested change (direction)
- Update the validation step `if:` to something like:
- `if: ${{ (github.event_name != 'workflow_dispatch' || inputs.run_claude_validation) && secrets.ANTHROPIC_API_KEY != '' }}`
- Update the “Skip validation notice” step condition to mirror the new logic (skip when validation disabled on dispatch OR when the secret is missing).
- Prefer checking `secrets.ANTHROPIC_API_KEY != ''` in the `if:` rather than relying on step-local env variables.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| - name: Validate screenshots with Claude | ||
| id: validate | ||
| if: ${{ (inputs.run_claude_validation == true || inputs.run_claude_validation == '') && env.HAS_API_KEY == 'true' }} | ||
| run: | | ||
| echo "Running Claude vision validation..." | ||
|
|
||
| # Run validation and capture exit code | ||
| set +e | ||
| python scripts/validate_screenshots.py screenshots/ \ | ||
| --output screenshots/validation_report.json | ||
| VALIDATION_EXIT_CODE=$? | ||
| set -e | ||
|
|
||
| # Store result for later steps | ||
| if [ $VALIDATION_EXIT_CODE -eq 0 ]; then | ||
| echo "validation_passed=true" >> $GITHUB_OUTPUT | ||
| echo "✅ All visual validations passed!" | ||
| else | ||
| echo "validation_passed=false" >> $GITHUB_OUTPUT | ||
| echo "❌ Some visual validations failed" | ||
| fi | ||
|
|
||
| exit 0 # Don't fail here, we'll check in a later step | ||
| env: | ||
| ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} | ||
| HAS_API_KEY: ${{ secrets.ANTHROPIC_API_KEY != '' }} | ||
|
|
||
| - name: Skip validation notice | ||
| if: ${{ inputs.run_claude_validation == false || env.HAS_API_KEY != 'true' }} | ||
| env: | ||
| HAS_API_KEY: ${{ secrets.ANTHROPIC_API_KEY != '' }} | ||
| run: | |
There was a problem hiding this comment.
3. Has_api_key used in if 🐞 Bug ⛯ Reliability
The workflow’s if: conditions reference env.HAS_API_KEY, but HAS_API_KEY is only defined inside step env: blocks; this is fragile and can cause unexpected skipping depending on evaluation timing. Gate directly on secrets.ANTHROPIC_API_KEY != '' or set HAS_API_KEY at job level.
Agent Prompt
### Issue description
`env.HAS_API_KEY` is referenced in `if:` expressions, but `HAS_API_KEY` is only defined inside the same step’s `env:` block. This is fragile and can lead to the validation step unexpectedly skipping.
### Issue Context
The intention is: run validation only when an API key exists (and optionally when enabled by workflow input).
### Fix Focus Areas
- .github/workflows/visual-tests.yml[100-131]
### Suggested change (direction)
- Prefer:
- `if: ${{ secrets.ANTHROPIC_API_KEY != '' && (github.event_name != 'workflow_dispatch' || inputs.run_claude_validation) }}`
- If you still want `HAS_API_KEY`, define it at the job level (`jobs.visual-test.env`) so it is available in `if:` contexts consistently.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| # Return success if at least some screenshots were taken | ||
| successful = sum(1 for r in results.values() if r is not None) | ||
| if successful > 0: | ||
| print(f"\n✓ {successful}/{len(results)} screenshots captured successfully") | ||
| return 0 | ||
| else: | ||
| print("\n✗ No screenshots were captured") | ||
| return 1 |
There was a problem hiding this comment.
4. Partial capture can pass 🐞 Bug ✓ Correctness
visual_test_overlay.py exits success if it captures at least one screenshot, and validate_screenshots.py validates whatever subset it finds. CI can go green while missing scenarios, undermining regression detection.
Agent Prompt
### Issue description
The visual test harness and validator currently allow partial success:
- Screenshot capture exits 0 if *any* screenshot exists.
- Validation proceeds with whatever subset it finds.
This can produce false-green CI runs that don’t actually validate all 5 scenarios.
### Issue Context
The workflow’s stated goal is continuous regression detection across 5 scenarios.
### Fix Focus Areas
- scripts/visual_test_overlay.py[192-199]
- scripts/validate_screenshots.py[29-101]
- scripts/validate_screenshots.py[363-373]
- .github/workflows/visual-tests.yml[69-89]
### Suggested change (direction)
- In `visual_test_overlay.py`, return non-zero unless `successful == len(results)` (or unless all required test IDs succeeded).
- In `validate_screenshots.py`, verify that all keys in `VALIDATION_CRITERIA` exist in `screenshots` and treat missing IDs as failures (exit 1).
- In the workflow, optionally assert `steps.capture.outputs.screenshot_count == 5` before running validation and fail early if not.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Fixes from PR #8 review: CI/Workflow fixes: - Add libdbus-1-dev to system dependencies - Install Python deps without dbus extra to avoid build failures - Change QT_QPA_PLATFORM from offscreen to xcb for xvfb compatibility - Fix workflow condition to check secrets directly instead of env.HAS_API_KEY - Require all 5 screenshots (fail if incomplete) - Upload artifacts even on failure (if: always()) Screenshot capture fixes: - Fix key naming inconsistency (02_suggestions consistently) - Add 4th suggestion to test truncation (4 input → 3 displayed) - Require ALL screenshots for success exit code - Reorder screenshot tools (scrot first for xvfb compatibility) Claude validation fixes: - Add proper error handling with retries and exponential backoff - Handle APIConnectionError, RateLimitError, APIStatusError separately - Extract text from all content blocks (not just first) - Select most recent screenshot when duplicates exist - Add app context to validation prompt - Update truncation test to check for Meta+Down absence Development tooling: - Add justfile with common dev commands - Add screenshots/ to .gitignore - Create docs/HOOKS.md with proposed Claude Code hooks Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| --- | ||
|
|
||
| **Last Updated**: 2025-01-07 |
There was a problem hiding this comment.
The "Last Updated" date is set to "2025-01-07", which is in the past relative to the current date (February 14, 2026). Consider updating this to the current date or removing the static date in favor of using Git commit timestamps for tracking when documentation was last modified.
| **Last Updated**: 2025-01-07 | |
| **Last Updated**: See Git history for modification dates. |
|
|
||
| - name: Validate screenshots with Claude | ||
| id: validate | ||
| if: ${{ (github.event_name == 'workflow_dispatch' && inputs.run_claude_validation == true) || (github.event_name == 'pull_request' && secrets.ANTHROPIC_API_KEY != '') }} |
There was a problem hiding this comment.
The condition checking secrets.ANTHROPIC_API_KEY != '' will always evaluate to false when the secret doesn't exist, as GitHub Actions treats non-existent secrets as empty strings. However, the expression secrets.ANTHROPIC_API_KEY != '' in a conditional context actually checks if the secret has a non-empty value, which is the intended behavior. This is correct, but for clarity and better practice, consider using secrets.ANTHROPIC_API_KEY alone, which evaluates to true if the secret exists and has a value.
| if: ${{ (github.event_name == 'workflow_dispatch' && inputs.run_claude_validation == true) || (github.event_name == 'pull_request' && secrets.ANTHROPIC_API_KEY != '') }} | |
| if: ${{ (github.event_name == 'workflow_dispatch' && inputs.run_claude_validation == true) || (github.event_name == 'pull_request' && secrets.ANTHROPIC_API_KEY) }} |
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: overlay-screenshots-${{ github.sha }} |
There was a problem hiding this comment.
The artifact name uses ${{ github.sha }} which will create a very long artifact name (e.g., "overlay-screenshots-abc123def456..."). GitHub Actions allows artifact names up to 256 characters, so this should work, but consider using a shorter suffix like ${{ github.run_number }} or ${{ github.run_id }} for better readability in the artifacts list, or simply "overlay-screenshots" if uniqueness per run isn't critical (newer uploads will overwrite older ones with the same name).
| name: overlay-screenshots-${{ github.sha }} | |
| name: overlay-screenshots-${{ github.run_number }} |
Documentation updates: - README.md: Add visual testing section, justfile commands, workflow badge - visual-test-checklist.md: Fix QT_QPA_PLATFORM from offscreen→xcb, clarify truncation test - PR-05-overlay-checklist.md: Add automated testing deliverables, update task status - docs/updates/README.md: Add 2026-01-07 status entry for visual testing work Consistency fixes: - Changed all references from offscreen to xcb platform - Clarified test 04_max_three tests truncation (4 input → 3 displayed) - Added missing scrot/imagemagick requirements to troubleshooting - Updated project status to reflect completed automated testing No functionality changes - documentation only.
Fixes from PR #8 review: CI/Workflow fixes: - Add libdbus-1-dev to system dependencies - Install Python deps without dbus extra to avoid build failures - Change QT_QPA_PLATFORM from offscreen to xcb for xvfb compatibility - Fix workflow condition to check secrets directly instead of env.HAS_API_KEY - Require all 5 screenshots (fail if incomplete) - Upload artifacts even on failure (if: always()) Screenshot capture fixes: - Fix key naming inconsistency (02_suggestions consistently) - Add 4th suggestion to test truncation (4 input → 3 displayed) - Require ALL screenshots for success exit code - Reorder screenshot tools (scrot first for xvfb compatibility) Claude validation fixes: - Add proper error handling with retries and exponential backoff - Handle APIConnectionError, RateLimitError, APIStatusError separately - Extract text from all content blocks (not just first) - Select most recent screenshot when duplicates exist - Add app context to validation prompt - Update truncation test to check for Meta+Down absence Development tooling: - Add justfile with common dev commands - Add screenshots/ to .gitignore - Create docs/HOOKS.md with proposed Claude Code hooks Co-Authored-By: Claude <noreply@anthropic.com>
Summary
This PR introduces an automated visual testing system for the overlay UI that captures screenshots and validates them using Claude's vision capabilities. The system integrates with GitHub Actions to run on every PR that touches overlay code, providing continuous visual regression detection.
Key Changes
scripts/visual_test_overlay.py: New test harness that launches the overlay in demo mode and captures screenshots across 5 test scenarios:scripts/validate_screenshots.py: New validation script that sends screenshots to Claude API for automated visual validation against defined criteria:.github/workflows/visual-tests.yml: New GitHub Actions workflow that:docs/plans/visual-test-checklist.md: Documentation covering:Implementation Details
https://claude.ai/code/session_01JuVGRqFXwACVZKofjYoJin